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Abstract 

In this paper, we propose a scalable and highly efficient index structure for the reachabil- 
ity problem over graphs. We build on the well-known node interval labeling scheme where 
the set of vertices reachable from a particular node is compactly encoded as a collection of 
node identifier ranges. We impose an explicit bound on the size of the index and flexibly 
assign approximate reachability ranges to nodes of the graph such that the number of index 
probes to answer a query is minimized. The resulting tunable index structure generates a 
better range labeling if the space budget is increased, thus providing a direct control over 
the trade off between index size and the query processing performance. By using a fast 
recursive querying method in conjunction with our index structure, we show that in prac- 
tice, reachability queries can be answered in the order of microseconds on an off-the-shelf 
computer - even for the case of massive-scale real world graphs. Our claims are supported 
by an extensive set of experimental results using a multitude of benchmark and real-world 
web-scale graph datasets. 

1 Introduction 

Reachability queries are a fundamental operation in graph mining and algorithmics and ample 
work exists on index support for reachability problems. In this setting, given a directed graph 
and a designated source and target node, the task of the index is to determine whether the graph 
contains a path from the source to the target. 

Computing reachability between nodes is a building block in many kinds of graph analytics, 
for example biological and social network analysis, traffic routing, software analysis, and linked 
data on the web, to name a few. In addition, a fast reachability index can prove useful for 
speeding up the execution of general graph algorithms - such as shortest path and Steiner tree 
computations - via search-space pruning. As an example, Dijkstra's algorithm can be greatly 
sped up by avoiding the expansion of vertices that cannot reach the target node. 

While the reachability problem is a light-weight task in terms of its asymptotic complexity, 
the advent of massive graph structures comprising hundreds of millions of nodes and billions of 
edges can render even simple graph operations computationally challenging. It is thus crucial 
for reachability indices to provide answers in sublinear or ideally near-constant time. Further 
complicating matters, the index structures, which generally reside in main-memory, are expected 
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to satisfy an upper-bound on the size. In most scenarios, the available space is scarce, ranging 
from little more than enough to store the graph itself to a small multiple of its size. 

Given their wide applicability, reachability problems have been one of the research foci in 
graph processing over recent years. While many proposed index structures can easily handle 
small to medium-size graphs comprising hundreds of thousands of nodes (e.g., [TJ |H [TTJ 
IT2l [T3l [TBI IT7T irsl \19\), massive problem instances still remain a challenge to most of them. 
The only technique that can cope with web-scale graphs while satisfying the requirements of 
restricted index size and fast query processing time, employs guided online search [3TJ [2"!3] . 
leading to an index structure that is competitive in terms of its construction time and storage 
space consumption, yet speeds up reachability query answering significantly when compared 
to a simple DFS/BFS traversal of the graph. However, it suffers from two major drawbacks. 
Firstly, given the demanding constraints on precomputation time, only basic heuristics are used 
during index construction, which in many cases leads to a suboptimal use of the available space. 
Secondly and more importantly, while the majority of reachability queries involving pairs of 
nodes that are not reachable can be efficiently answered, the important class of positive queries 
(i. e. the cases in which the graph actually contains a path from the source to target) has to be 
regarded as a worst-case scenario due to the need of recursive querying. This can severely hurt 
the performance of many practical applications where positive queries occur frequently. 

The reachability index structure we propose in this paper - coined FERRARI (for Flexible 
and Efficient Reachability Range Assignment for gRaph Indexing) - overcomes the limitations of 
existing approaches by adaptively compressing the transitive closure during its construction. This 
technique enables the efficient computation of an index geared towards minimizing the expected 
query processing time given a user-specified constraint on the resulting index size. Our proposed 
index supports positive queries efficently and outperforms GRAIL, the best prior method, on 
this class of queries by a large margin, while in the vast majority of our experiments also being 
faster on randomly generated queries. To achieve these performance gains, we adopt the idea of 
representing the transitive closure of the graph by assigning identifiers to individual nodes and 
encoding sets of reachable vertices by intervals, first introduced by Agrawal et al. [TJ. Instead 
of materializing the full set of identifier ranges at every node, we adaptively merge adjacent 
intervals into fewer yet coarser representations at construction time, whenever a certain space 
budget is exceeded. The result is a collection of exact and approximate intervals that are assigned 
as labels of the nodes in the graph. These labels allow for a guided online search procedure that 
can process positive as well as negative reachability queries significantly faster than previously 
proposed size-constrained index structures. 

The interval assignment underlying our approach is based on the solution of an associated interval 
cover problem. Efficient algorithms for computing such a covering structure together with an 
optimized guided online search facilitate an efficient and flexible reachability index structure. 
In summary, this paper makes the following technical contributions: 

• a space-adaptive index structure for reachability queries based on selective compression of 
the transitive closure using exact and approximate reachability intervals, 

• efficient algorithms for index construction and querying that allow extremely fast query 
processing on web-scale real world graphs, and 

• extensive experiments that demonstrate the superiority of our approach in comparison to 
the best prior method that satisfies index size constraints, GRAIL. 

The remainder of the paper is organized as follows: In Section [2] we introduce necessary 
notation and the basic idea of reachability interval labeling. Afterwards, we give a short in- 
troduction to approximate interval indexing in Section [3j followed by an in-depth treatment of 
our proposed index (Section 2]). An overview over our query processing algorithm is given in 
Section [5j followed by the experimental evaluation and concluding remarks. 
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2 Preliminaries 



In the following, let G = (V,E) denote a directed graph with n :— \V\ nodes and m := \E\ 
edges. For a node v, let 

J\f + (v) := {w G V | («, w) G £} (1) 

and Af~(v) :={ueV\ (u,v) G (2) 

denote the sets of nodes with an edge coming from and leading to v, respectively. 

Given nodes it, v € V, we call u reachable from u, written as it ~ u, if _E contains a directed 
path from it to v. Further, we denote the set of vertices reachable from v G V in G by TZq{v) '■= 
{w e y | d ~ it;}. We call TZg{v) the reachable set of w (we drop the subscript whenever the 
graph under consideration is clear from the context). A reachability query with source node 
it and destination node v of a graph G is expressed as a triple (G, it, u) and answered with a 
boolean value by a reachability index. 

A pair (u,v) of nodes exhibits strong reachability if it ~ v and t> ~ u, that is, it and i? are 
mutually reachable in G. Note that strong reachability induces an equivalence relation on the 
set of nodes. The equivalence classes of this relation, that is, the maximal subsets V C V with 
it v for all u, v G V, are called strongly connected components of G. For a node v G V let [v] 
denote the strongly connected component that contains v. 

Condensed Graph. We dchnc the condensed graph of G, denoted as Gc = (Vc,Ec), as the 
graph obtained after collapsing the maximal strongly connected components into "supernodes" , 
i. e. 

V c ■= {H | ve V} (3) 
and E c := {([u], [v]) \ (u,v) G E, [u] ^ [«]}. (4) 
By definition, Gc is a directed acyclic graph (DAG). 

It is important to note that the reachability queries (G, it, d) and (Gc, [it], [v]) are equivalent. 
Thus, the existing index structures (including ours) consider only directed acyclic graphs, that 
is, create an index over the condensed graph Gc- At query time, the input nodes u,v are then 
mapped to their respective strongly connected components, allowing early termination whenever 
[u] = [v]. 

Tree Cover and Graph Augmentation. For a graph G, we define a tree cover of G, denoted 
as T(G), as a directed spanning tree of G. If G contains more than one node with no incoming 
edges, instead of a tree only a spanning forest can be obtained. 

In this case, we augment the graph G by introducing an artificial root node r that is connected 
to every node with no incoming edge: 

G' := (VU{r},E{j{(r,v) \ v G V, Af-(y) = 0}). (5) 

Note that this modification has no effect on the reachability relation among the existing nodes 
of G. 

Integer Intervals. For integers x,y G IN, a: < y, we use the interval [x,y] to represent the set 
{x, x + 1, ... , y}. Let I = [a, b] and J = \p, q] denote integer intervals. We define |/| :— b — a + 1 
to denote the number of elements contained in /. Further, we call J subsumed by I, written 
J E I, if J corresponds to a subinterval of /, i. e 

J C I <^=> a<p<q<b. (6) 

Further, J is called an extension of /, denoted I^J, if the start-point but not the end-point of 
J is contained in /: 

I^J a<p<b<q. (7) 

An overview over the symbols used throughout this paper is given in Table [T] 
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Symbol Description 



V set of vertices 
E set of edges 



\V\ 
\E\ 



= m 



Af + /~(v) set of direct successors/predecessors of v 

TZg(v) reachable set of node v in G 

[v] strongly connected component of v 

Gc condensed graph (collapsed SCCs) 

G 1 augmented graph (virtual root node r) 

T(G) tree cover of G 

T v subtree of tree T, rooted at v 

tt(v) post-order number of node v 

t(v) topological order number of node v 

It(v) reachability interval of v in tree T 

X(v) set of reachability intervals of v 



Table 1: Notation 



2.1 Interval Indexing 

In this section, we introduce the concept of node identifier intervals for reachability processing, 
first proposed by Agrawal et al. [I], which provided the basis of many subsequent indexing 
approaches, including our own. The key idea is to assign numeric identifiers to the nodes in the 
graph and represent the reachable sets of vertices in a compressed form by means of interval 
representations. This technique is based on the construction of a tree cover of the graph followed 
by post-order labeling of the vertices. 

Let G' denote the augmented input graph as defined above. Further, let T = (Vt, Et) denote 
a tree cover of G' . In order to assign node identifiers, the tree is traversed in depth-first manner. 
In this setting, a node v is visited after all its children have been visited. The post-order number 
w(v) corresponds to the order of v in the sequence of visited nodes. 

Example. Consider the augmented example graph depicted in Figure QJi with the virtual root 
node r. In this example, the children of a node are traversed in lexicographical order, leading to 
the spanning tree induced by the edges shown in bold in Figure \Tjp. The first node to be visited 
is node e, which is assigned post-order number 1. Node a is visited as soon as its children {c, d} 
have been visited. The last visited node is the root r. 

Tree Indexing. The enabling feature, which makes post-order labeling a common ingredient 
in reachability indices, is the resulting identifier locality: For every (complete) subtree of T, the 
ordered identifiers of the included nodes form a contiguous sequence of integers. The vertex set 
of any such subtree can thus be compactly expressed as an integer interval. Let T v = (Vr v , -Et„) 
denote the subtree of T rooted at node v. We have 



Above interval is called tree interval of v and will be denoted by It{v) in the remainder of the 
text. 

Example (cont'd). The subtree rooted at node a in Figure [T|b contains the nodes {a,c,d,e} 
with the set of identifiers {4, 2, 3, 1}. Thus, the nodes reachable from a in T are represented by 
the tree interval [1,4]. The final assignment of tree intervals to the nodes is shown in Figure [TJ;. 
The complete reachability information of the spanning tree T is encoded in the collection of tree 



{tt(w) I w € Vt v } 



min n(w), max ir(w) 

w C Vr w C W r 



(8) 



min tt(w), 7r(u) 
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6 [1,6] [1,6] 




1 [1,1] [1,1] 



(a) Input Graph (b) Post-Order Labeling (c) Interval Assignment (d) Interval Propagation 

Figure 1: Post- Order Interval Assignment 

intervals. For a pair of nodes u, v 6 V , there exists a path from u to v in T iff the post-order 
number of the target is contained in the tree interval of the source, that is, 

u v ^=5- 7r(u) € It{u). (9) 

This reachability index for trees allows for 0(1) query processing at a space consumption of 
<D(n). 

Extension to DAGs. While above technique can be used to easily answer reachability queries 
on trees, the case of general DAGs is much more challenging. The reason is that, in general, 
the reachable set IZ(v) of a vertex v in the DAG is only partly represented by the interval 
It(v), as the tree interval only accounts for reachability relationships that are preserved in T. 
Vertices that can only be reached from a node v by traversing one or more non-tree edges have 
to be handled seperately: instead of merely storing the tree intervals It{v), every node v is now 
assigned a set of intervals, denoted by T(v). The purpose of this so-called reachable interval set 
is to capture the complete reachability information of a node. The sets I(v), v € V are initialized 
to contain only the tree interval It(v). Then, the vertices are visited in reverse topological order. 
For the current vertex v and every incoming edge (u, v) € E, the reachable interval set l(v) 
is merged into the set I(u). The merge operation on the intervals resolves all cases of interval 
subsumption and extension exhaustively, eventually ensuring interval disjointness. Due to the 
fact that the vertices are visited in reverse topological order, it is ensured that for every non-tree 
edge (s,t) G E \ Et, the reachability intervals in I(t) will be propagated and merged into the 
reachable interval sets of s and all its predecessors. As a result, all reachability relationships are 
covered by the resulting intervals. 

Example (cont'd). Figure [TJ; depicts the assignment of tree intervals to the nodes. As de- 
scribed above, in order to compute the reachable interval sets, the nodes are visited in ascending 
order of the post-order values (or, equivalently, in reverse topological order), thus starting at 
node e. The tree interval Ir(e) = [1,1] is merged into the set of node c, leaving 1(c) = {[1,2]} 
unchanged due to interval subsumption. Next, ir(e) is merged at node d, resulting in the reach- 
able interval set 1(d) = {[1,1], [3,3]}. The reverse topological order in which the vertices are 
visited ensures that the interval [1, 1] is further propagated to the nodes b, a, and r. 

Query Processing. Using the reachable interval sets X(v), queries on DAGs can be answered 
by checking whether the post-order number of the target is contained in one of the intervals 
associated with the source: 

u~ v ^> B([a,0] el(u)) : a < tt(u) < /3. (10) 

Example. Consider again the graph depicted in Figure [TJi. The reachable vertex set of node 
d is given by I{d) — {[1, 1], [3, 3]}. This set provides all the necessary information in order to 
answer reachability queries involving the source node d. 
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By ordering the intervals contained in a set, reachability queries can now be answered ef- 
ficiently in O(logn) time on DAGs. The resulting index (collection of reachable interval sets) 
can be regarded as a materialization of the transitive closure of the graph, rendering this ap- 
proach potentially infeasible for large graphs, both in terms of space consumption as well as 
computational complexity. 



3 Approximate Intervals 

For massive problem instances, indexing approaches that materialize the transitive closure (or 
compute a compressed variant without an a priori size restriction), suffer from limited applica- 
bility. For this reason, recent work on reachability query processing over massive graphs includes 
a shift towards guided online search procedures. In this setting, every node is assigned a concise 
label which - in contrast to the interval sets described in Section 12.11 - is restricted by a prede- 
fined size constraint. These labels in general do not allow answering the query after inspection 
of just the source node, yet can be used to prune portions of the graph in an online search. 

As a basic example, consider a reachability index that labels every node v £ V with its 
topological order number t(v). While this simple variant of node labeling is obviously not 
sufficient to answer a reachability query by means of a single-lookup, a graph search procedure 
can greatly benefit from the node labels: For a given query (s, i), the online search rooted at 
s can terminate the expansion of a branch of the graph whenever for the currently considered 
node v it holds 

T(V) > T(t). (11) 

This follows from the properties of a topological ordering. 

The recently proposed GRAIL reachability index (2TJ [22] further extends this idea by labeling 
the vertices with approximate intervals: 

Suppose that for every node v we replace the set T{v) by a single interval 



I'(v) 



min tt(w), max ir(w) 



(12) 



spanning from the lowest to the highest reachable id. This interval is approximate in the sense 
that all reachable ids are covered whereas false positive entries are possible: 

Definition 1 (False Positive). Let v £ V denote a node with the approximate interval 
I'(v) = [a, (3]. A vertex w £ V is called false positive with respect to I'(v) if 

a < 7r(w) < P and v / w. (13) 

Obviously, the single interval I'(v) is not sufficient to establish a definite answer to a reach- 
ability query of the form (G,v,u>). However, all queries involving a target id tt(w) that lies 
outside the interval, i. e. 

7r(u>) < a or 7r(w) > j3, (14) 

can be answered instantly with a negative answer, similar to the basic approach based on Equa- 
tion (fTTj) . In the opposite case, that is, 

a < tt(w) < 13, (15) 

no definite answer to the reachability query can be given and the online search procedure contin- 
ues with an expansion of the child vertices, terminating as soon as the target node is encountered 
or all branches have been expanded or pruned, respectively. 
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In practical applications the GRAIL index assigns a number of k > 1 such approximate 
intervals to every vertex, each based on a different (random) spanning tree of the graph. The 
intuition behind this labeling is that an ensemble of independently generated intervals improves 
the effectiveness of the node labels since each additional interval potentially reduces the remain- 
ing false positive entries. 

The advantage of this indexing approach over a materialization of the transitive closure is obvi- 
ous: the size of the resulting labels can be determined a priori by an appropriate selection of the 
number (fc) of intervals assigned to each node. In addition, the node labels are easily computed 
by means of k DFS traversals of the graph. 

Empirically, GRAIL has been shown to greatly improve the query processing time over online 
DFS search in many cases. However, especially in the case of positive queries, a large portion 
of the graph still has to be expanded. While extensions have been proposed to GRAIL to 
improve performance on positive queries |22j . the processing time in these cases remains high. 
Furthermore, while an increase of the number of intervals assigned to the nodes potentially 
reduces false positive elements, no guarantee can be made due to the heuristic nature of the 
underlying algorithm. As a result, in many cases superfluous intervals are stored, in some cases 
negatively impacting query processing time. 

4 The Ferrari Reachability Index 

In this section, we present the FERRARI reachability index which enables fast query processing 
performance over massive graphs by a more involved node labeling approach. The main goal 
of our index is the assignment of a mixture of exact and approximate reachability intervals 
to the vertices with the goal of minimizing the expected query processing time, given a user- 
specified size constraint on the index. Contrasting previously proposed approaches, we show 
both theoretically and empirically that the interval assignment of the FERRARI index utilizes 
the available space for maximum effectiveness of the node labels. 

Similar to previously proposed index structures [TJ [TS1 [2TJ [52] , we use intervals to encode 
reachability relationships of the vertices. However, in contrast to existing approaches, FER- 
RARI can be regarded as an adaptive transitive closure compression algorithm. More precisely, 
FERRARI uses selective interval set compression, where a subset of adjacent intervals in an 
interval set is merged into a smaller number of approximate intervals. The resulting node label 
then retains a high pruning effectiveness under a given size-restriction. 

Before we delve into the details of our algorithms and the according query processing procedure, 
we first introduce the basic concepts that facilitate our interval assignment approach. 

The FERRARI index distinguishes between two types of intervals: approximate (similar to 
the intervals in Section |3|) and exact (as in Section l2Tj) . depending on whether they contain false 
positive elements or not. 

Let / denote an interval. To easily distinguish between interval types, we introduce an 
indicator variable r\i such that 



As outlined above, a main characteristic of FERRARI is the assignment of size-restricted 
interval sets comprising approximate and exact intervals as node labels. Before we introduce 
the algorithmic steps that facilitate the index construction, it is important to explain how 
reachability queries can be answered using the proposed interval sets. Let (G, s, t) denote a 
reachability query and I(s) — {/i, I2, ■ ■ ■ , In} the set of intervals associated with node s. In 
order to determine whether t is reachable from node s, we have to check whether the post-order 
identifier ir(t) of t is included in one of the intervals in the set I(s). If 7r(i) lies outside of 




if / approximate, 



(16) 
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all intervals I\, . . . ,Ijsr, the query terminates with a negative answer. If however it holds that 

n(t) G Ii for one Ii g T(s), we have to distinguish two cases: (i) if I t is exact then s is guaranteed 

to reach node t and (ii) if Ii is approximate, the neighbors of node s have to be queried recursively 

until a definite answer can be given. Obviously, recursive expansions are costly and it is thus 

desirable to minimize the number of cases that require lookups beyond the source node. 

To formally introduce the according optimization problem, we define the notion of interval 

covers: 

Definition 2 (fc-interval Cover). Let fc > 1 denote integer and X — {[ai,/?i], . . . , [ajv,/3Ar]} 
a set of intervals. A set C = {[a[, P[], . . . , [a[, P[]} is called k- interval cover of I, written as 
C 3fe T, ii C covers all elements from X using no more than k intervals, i. e. 

N I 

\ai<j< pi) C \J{f | oi i < f < #} (17) 
i=i i=i 

with I < k. (18) 



Note that an interval cover of a set of intervals is easily obtained by merging an arbitrary 
number of adjacent intervals in the input set. Next, we address the problem of choosing an 
fc-interval cover that maximizes the pruning effectiveness. 

Definition 3 (Optimal fc-interval Cover). Let k > 1 denote an integer and I = 
{Ii, I2, ■ ■ ■ , In} an interval set of size N. We define the optimal fc-interval cover of I by 

XI := argmin ^(1-^) |/|, (19) 

that is, the cover of I with no more than k intervals and the minimum number of elements in 
approximate intervals. 

Note that by replacing the set of exact reachability intervals I(v) by its optimal fc-interval 
cover X-k(v) - which is then used as the node label in our index - we retain maximal effectiveness 
for terminating a query. The reason is that the number of cases that require recursive querying 
directly corresponds to the number of elements contained in approximate intervals. 



4.1 Computing the Optimal Interval Cover 

While the special cases k = N (optimal fc-interval cover of I is the set I itself) and fc = 1 (optimal 
solution corresponds to the single approximate interval assigned by GRAIL, see Equation [T2|) 
are easily solved, we next introduce an algorithm that solves the problem for general values of 
fc: 

As hinted above, an interval cover can be computed by selectively merging adjacent intervals 
from the original assignment made to the node v. In order to derive an algorithm for computing 
X^{v), we first transform the interval set at the node v into its dual representation where the 
gaps between intervals are specified. As usual, let I = {I\, I2, ■ ■ . , In} with Jj — [on, 

The set T := {71, 72, ... , 7./V-1}, 7; = [Pi + 1, — 1] denotes the gaps between the intervals 
contained in I: 





71 




12 




73 












to 
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Note that the gap set T together with the boundary elements ai,/3jy is an equivalent rep- 
resentation of X. For a subset GCTwe denote by ((G) the induced interval set obtained by 
merging adjacent intervals Jj,/j+i if for their mutually adjacent gap jj it holds 7, ^ G. As an 
illustrative example, for the interval set depicted above we have C({72}) : = {[ai,/^], [0(3, /S4]}. 

Every induced interval set ((G) actually corresponds to a \G\ + 1-interval cover of the original 
set X. It is easy to see that the optimal fc-interval cover can equivalently be specified by a subset 
of gaps. 

In order to compute the optimal fc-interval cover, we thus transform the problem defined in 
Equation ([!§]) into the equivalent problem of selecting the "best" k — 1 gaps from the original 
gap set r (or, equivalently, determining the \X\ — k — 1 that are not included in the solution). For 
a potential solution G C T of at most k — 1 gaps to preserve, we can assess its cost, measured by 
the number of elements in the induced interval cover that are contained in approximate intervals: 



where for / g ((G) it holds 



c(G):= ]T (1-^)1/1, (20) 
/eC(G) 



111 = (21) 




Clearly, our goal is to determine the set r^._ 1 C T such that 

r|_ x := argmin c(G). (22) 

GCr, |G|<fc-l 

We now present a dynamic programming approach to obtain the optimal set of k — 1 gaps. 
In the following, we denote for a sequence of intervals X — (1%, I 2 , ■ ■ ■ , In), the subsequence 
consisting of the first j intervals by Xj := (Ji, I2, ■ ■ • , Ij). Now, observe that every set of 
gaps G C r, \G\ < k — 1 represents a valid fc-interval cover for each of the interval sequences 
2max{z I 7iSG}7 • • • ,Ijv, yet at different costs (the cost corresponding to each of these coverings 
is strictly non-decreasing). In order to obtain a optimal substructure formulation, consider the 
problem of computing the optimal fc-interval cover for the interval sequence Xj. The possible 
interval covers can be represented by a collection of sets of gaps: 



Mlj):={GC{ 7 i,72,..,7i-i} I \G\ <fc-l} =G^_ 1 (X J )UG+_ 1 (X J ) (23) 



with 



&k-i( X i) : = S k -i(Xj-i) 
and := {G U { 7j -i} | G € ^-2(^-1)} ( 24 ) 

that is, Gfc_i(Xj) is the collection of all subsets of {71, . . . ,7j_i} comprising not more than k— 1 
elements and G^_i(Xj) corresponds to the collection of all sets of gaps including Jj-x and not 
more than k — 2 elements from {71, 72, . . . ,7^-2}- 

From Equations (|23I24[) we can deduce that every fc-interval cover of Xj and thus the optimal 
solution is either 

• a fc-interval cover of Xj_! or 

• a fc — 1-interval cover of Xj_i combined with the gap 7j-i between the last two intervals, 
Ij-i and Ij. 
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Thus, for the optimal solution T^_^(Xj) we have 



tin-id)) 



= nun 



min c(G), min c(G) 
Geg+^ij) Gee 



= min 



Ir^Yx i c(G) 'r PG m 7i c(G) + | 7 ,_i| + |/,|}) 



= mm 



(25) 



We can exploit the optimal substructure derived in Equation (|25[) for the desired dynamic 
programming procedure: For each < i < N we have to compute the fc'-interval cover for 
fc — N + i < k' < k, thus obtaining the optimal solution in time O(kN). 

In some practical applications the amount of computation can become prohibitive, as one 
instance of the problem has to be solved for every node in the graph. Thus, in our implementa- 
tion, we use a simple and fast greedy algorithm that, starting from the empty set iteratively adds 
the gap 7 G r that leads to the greatest reduction in cost given the current selection G, until 
at most k — 1 gaps have been selected, then compute the interval cover from C(G). While the 
gain in speed comes at the cost of a potentially suboptimal cover, our experimental evaluation 
demonstrates that this approach works well in practice. 

In the next section, we explain how above node labeling technique is eventually used as a building 
block during the reachability index computation. 

4.2 Index Construction 

At precomputation time, the user specifies a certain budget B = kn, k > 1 of intervals that can be 
assigned to the nodes, thus directly controlling the tradeoff between index size/precomputation 
time and pruning effectiveness of the respective nodes labels. The subsequent index construction 
procedure can be broken down into the following main stages: 

4.2.1 Tree Cover Construction 

Agrawal et al. !■ propose an algorithm for computing a tree cover that leads to the minimum 
number of exact intervals to store. This tree can be computed in 0(mn) time, rendering the 
approach infcasiblc for the case of massive graphs. While, in principle, heuristics could be used 
that are based on centrality measures or estimates of the sizes of reachable sets |T5] , we settle 
for a simpler solution that does not yield a certain approximation guarantee yet performs well 
in practice. We argue that a good tree cover should cover as many reachability relationships 
as possible in the initial tree intervals (see Equation [5]) . Therefore, every edge included in the 
tree should provide a connection between as many pairs of nodes as possible. To this end, we 
propose the following procedure to heuristically construct such a tree cover T: 
Let t : V — > {1,2, ... ,n} denote a topological ordering of the vertices, i. e. for all (u, v) € E 
it holds t(u) < t(v). Such a topological ordering is easily obtained by the classical textbook 
algorithm [7] in time 0(m + n). We interpret the topological order number of a vertex as the 
number of potential predecessors in the graph, because the number of predecessors of a given 
node is upper bounded by its position in the topological ordering. For a vertex v with set of 
predecessors Af~(v), we select the edge from node p E Af~(v) with highest topological order 
number for inclusion in the tree, that is 



The intuition is that node p has the highest number of potential predecessors and thus the 
selected edge (p, v) has the potential of providing a connection from a large number of nodes to 



p := argmax r(it). 



(26) 
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Algorithm 1: TreeCover(G) 

Input: directed acyclic graph G = (V,E) 

1 begin 

2 T <- (Vt,Et) <- (V, 0) 

3 t 4 — TopologicalSort(G) 

4 for i = n downto 1 do 

5 {v} <- {u e V | t(u) = i] 



8 return T 



v, eventually leading to a high number of reachability relationships encoded in the resulting tree 
intervals. An overview of the tree cover construction algorithm is depicted in Algorithm [TJ 

4.2.2 Interval Set Assignment 

As the next step, given the tree cover T, indexing proceeds by assigning the exact tree interval 
It(v), encoding the reachability relationships within the tree at each node v. This interval 
assignment can be obtained using a single depth-first traversal of T. 

In order to label every node v with a fc- interval cover X'(v) of its true reachable interval set, we 
visit the vertices of the graph in reverse topological order, that is, starting from the leaf with 
highest topological order, proceeding iteratively backwards to the root node. We initialize for 
node v the reachable interval interval set as X'(v) := {It(v)}. For the currently visited node v 
and every edge (v, w) € E, we merge X'(w) into X'(v), such that the resulting set of intervals is 
closed under subsumption and extension^ 

Then, in order to satisfy the size restriction of at most fc intervals associated with a node, 
we replace X'(v) by its fc-interval cover which is then stored as the node label of v in our index. 
The complete procedure is shown in detail in Algorithm [2] It is easy to see that the resulting 
index consisting of the sets of approximate and exact intervals X'(v),v € V comprises at most 
nk = B intervals. The upper bound J2 v ev \-^'i v )\ < -B is usually not tight, i.e. in practice, 
much less than B intervals are assigned. As an example, in the case of leaf nodes or the root 
vertex, a single interval suffices. The name of the algorithm - FERRARI-L thus reflects the 
fact that a local size restriction, < k, is satisfied by every interval set X'{v). 

Note that even though an optimal algorithm can be used to compute the fc-interval covers, the 
optimality of the local result does in general not extend to the global solution, i. e. the full set 
of node labels. The reason for this is the fact that adjacent intervals that are merged during the 
interval cover computation are propagated to the parent nodes. As a result, at the point during 
the execution of the algorithm where the interval set of the parent p has to be covered, the fc- 
interval cover is computed without knowledge of the true (exact) reachability intervals of p. More 
precisely, the input to the covering algorithm is a combination of approximate (thus previously 
merged) and exact intervals. Nevertheless, the resulting node labels prove very effective for early 
termination of reachability queries, as our experimental evaluation indicates. 

To further improve our reachability index, in the next section we propose a variant of the 
labeling algorithm that leads to an even better utilization of the available space. 

^"In our implementation we require non-adjacent intervals in the set, that is, for [ai,/3i], [a2,fe] £ I it must 
hold Pi < OL2- When sets of approximate and exact intervals are merged, the type of the resulting interval is 
based on several factors. For example, when an exact interval is extended by an approximate interval, the result 
will be one long approximate range. 
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if Af~(v) ^ then 
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Algorithm 2: Ferrari-L(G, B) 



Input: directed, acyclic graph G, interval budget B — kn 

Result: set of at most k approximate and exact reachability intervals I'(v) for every 
node v £ V 

1 begin 



T <- TreeCover(G) 

It <- AssignTreeIntervals(T) 

A; — 

n 

for i = n to 1 do 

{u} <- {u e V | r(u) = i} 
I'{v) <- {/ T („)} 
foreach id € A/" + (u) do 
L X'(w)-«-I / («)©I / (t£;) 

[> replace intervals by /c-interval cover 

*S — fc-lNTERVALCOVER(I'(u)) 

return {l'(v) \ v £ V) 



t> visit nodes in reverse topological order 



> merge interval sets 



4.3 Dynamic Budget Allocation 

As mentioned above, the interval assignment as described in Algorithm [2] usually leads to a total 
number of far less than B intervals stored at the nodes. In order to better exploit the available 
space, we extend our algorithm by introducing the concept of deferred interval merging where 
we can assign more than k intervals to the nodes on the first visit, potentially requiring to revisit 
a node at a later stage of the algorithm. 

The indexing algorithm for this interval assignment variant works as follows: Similar to FER- 
RARI-L, nodes are visited in reverse topological order and the interval sets of the neighboring 
nodes are merged into the interval set I'{v) for the current vertex v. However, in this new vari- 
ant, subsequent to merging the interval sets we compute the interval cover comprising at most 
ck intervals, given a constant c > 1. This way, more intervals can be stored in the node labels. 
After the c/c-interval cover has been computed, the vertex v is added to a min-heap structure 
where the nodes are maintained in ascending order of their degree. This procedure continues 
until the already assigned interval sets sum up to a size of more than B intervals. In this case, 
the algorithm repeatedly pops the minimum element from the heap and restricts its respective 
interval set by computing the fc-interval cover. This deferred interval set restriction is repeated 
until the number of assigned intervals again satisfies the size constraint B. 
Abive procedure leads to a much better utilization of the available space and thus a better qual- 
ity of the resulting reachability index. The improvement comes at the cost of increased index 
computation time, in practice the increase is two-fold in the worst-case, negligible in others. Our 
experimental evaluation suggests that a value of c = 4 provides a reasonable tradeoff between 
efficiency of construction and resulting index quality. This second indexing variant is shown in 
detail in Algorithm [31 We refer to the algorithm as the global variant (FERRARI-G) as in this 
case the size constraint is satisfied over all vertices - in contrast to the local size constraint of 
FERRARI-L. 

In the next section, we provide more details about our query answering algorithm and addi- 
tional heuristics that further speed up query processing over the FERRARI index. 
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Algorithm 3: Ferrari-G(G, B) 



Input: directed, acyclic graph G, interval budget B = kn, constant c > 1 
Result: set of approximate and exact reachability intervals l'(v) for every node v € V 
s.t. the total number of intervals is upper-bounded by B 



1 begin 

T <r 

Jrp < 

H i 



TreeCover(G) 

- AssignTreeIntervals(T) 

- InitializeMinHeapQ 



s <-0 

for i = n to 1 do 

{v} <- {v € V | t(v) = i} 
l'(v) <- {I T (v)} 
foreach w e Af + (v) do 

\> replace intervals by ck-interval cover 
I'(v) <— cfc-lNTERVALCOVER(I'(i;)) 

S <- s+\T(v)\ 
if \T(v)\ > k then 
|_ Heap-Push (H,v, \Af + (v)\) 

while s > B do 

WJ <- HEAP-POP(if) 

T'(w) <- A;-IntervalCover(I'(-u;)) 
s<- s- \2'(w)\ + k 

return {l'(v) \ v e V} 



\> number of currently assigned intervals 
> visit nodes in reverse topological order 



> merge interval sets 



5 Query Processing and Additional Heuristics 

The basic query processing over FERRARI'S reachability intervals is straightforward and resem- 
bles the basic approach of Agrawal et al. PQ: For every node v, the intervals in the set X'(v) are 
maintained in sorted order. Then, given a reachability query (G,s,t), it can be determined in 
O(log |X'(u)|) time whether the target id 7r(i) is contained in one of the intervals of the source. 
The query returns a negative answer (s 7^ t) if the target id lies outside all of the intervals and 
a positive answer if it is contained in one of the exact intervals. Finally, if n(t) falls into one 
of the approximate intervals, the neighbors of s are expanded recursively using a DFS search 
algorithm. 

Next, we introduce some heuristics that can further speed up query processing. 



5.1 Seed Based Pruning 

It is evident that, in the case of recursive querying, the performance of the algorithm depends 
on the number of vertices that have to be expanded during the online search. Nodes with a very 
high outdegree are especially costly as they might lead to a large number of recursive queries. 
In practice, such high degree nodes are to be expected due to the fact that (i) most of the real- 
world graphs in our target applications will follow a power-law degree distribution and (ii) the 
condensation graph obtained from the input graph produces high-degree nodes in many cases 
because the large strongly connected components usually exhibit a large number of outgoing 
edges. 

To overcome this problem, we propose to determine a set of seed vertices S C V and assign an 
additional label to every node v in the graph, indicating for every a € S whether G contains a 
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forward (backward) directed path from v to s. 

This labeling scheme works as follows: Every node will be associated with two sets, S~(v) and 
S + (v), such that S~(v) := jer € S | a ~ v}, and S' + (w) := {er € S" | t> ~ er }. Next, we describe 
the procedure for assigning the sets S + : 

For every node v, we initialize S + (v) — {v} if v € S 1 and S f+ (w ) = otherwise. We then maintain 
a FIFO-queue of all vertices, initialized to contain all leaves of the graph. At each step of the 
algorithm, the first vertex v is removed from the queue. Then, for every predecessor u, (u, v ) G E 
we set S + (u) 4— S + (u) U S + (v). If all successors of u have been processed, u itself is added to 
the end. The algorithm continues until all nodes of the graph have been labeled. It is easy to 
see that above procedure can efficiently be implemented. The approach for assignment of the 
sets S~ is similar (starting from the root nodes). 

Once assigned, the sets can be used by the query processing algorithm in the following way: For 
a query (G,s,t), 

1. if S+(s) n S-(t) 0, then s ~ t. 

2. if there exists a seed node a s. t. a e S~(s) and a ^ S~(t), that is, the seed a can reach s 
but not t, the query can be terminated with a negative answer (s/t). 

In our implementation we choose to elect the s nodes with maximum degree as seed nodes 
(requiring a minimum degree of 1). The choice of s can be specified prior to index construction, 
in our experiments we set s = 32. 

5.2 Pruning Based on Topological Properties 

We enhance the FERRARI index with two additional powerful criteria that allow additional 
pruning of certain queries. First, we adopt the effective topological level filter that was proposed 
by Yildirim et al. for the GRAIL index (see [32] for details). Second, we maintain the topological 
order t(v) of each vertex v for pruning as shown in Equation 

Before we proceed to the experimental evaluation of our index, we first give an overview over 
previously proposed reachability indexing approaches. 

6 Related Work 

Due to the crucial role played by reachability queries in innumerable applications, indices to 
speed them up have been subject of active research. Instead of exhaustively surveying previous 
results, we briefly describe some of the key proposals here. For a detailed survey, we direct the 
reader to [20] . In this section, we distinguish between reachability query processing techniques 
that are able to answer queries using only the label information on nodes specified in the query, 
and those which use the index to speed up guided online search over the graph. 

Before we proceed, it is worth noting that there are two recent proposals that aim to speed 
up the reachability queries from a different direction compared to the standard graph indexing 
approaches. First, in [10] . authors propose a novel way to compact the graph before applying 
any reachability index. Naturally, this technique can be used in conjunction with FERRARI, 
hence we consider it orthogonal to the focus of the present paper. The other proposal is to 
compress the transitive closure through a carefully optimized word-aligned bitmap encoding of 
the intervals [IS] . The resulting encoding, called PWAH-8, is shown to make the interval labeling 
technique of Nuutila [14j scale to larger graph datasets. In our experiments, we compare our 
performance with both Nuutila's Intervals assignment technique as well as the PWAH-8 variant. 
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6.1 Direct Reachability Indices 

Indices in this category answer a reachability query (G, s, t) using just the labels assigned to s 
and t. Apart from the classical algorithm pQ described in Section [2j another approach based on 
tree covering Q15] focused on sparse graphs (targeting near-tree structures in XML databases), 
labeling each node with its tree interval and computing the transitive closure of non-tree edges 
separately. 

Apart from trees to cover the graph being indexed, alternative simple structures such as 
chains and paths have also been used. In a chain covering of the graph, a node u can reach v if 
both belong to the same chain and u precedes v. In [5] an optimal way to cover the graph with 
chains in 0(n 3 ) was proposed, later reduced to 0(n 2 + dny/d), where d denotes the diameter 
of the graph [3]. Although chain covers typically generate smaller index sizes than the interval 
labeling and can answer queries efficiently, they are very expensive to build for large graphs. 
The PathTree index proposed recently |11) combines tree covering and path covering effectively 
to build an index that allows for extremely fast reachability query processing. Unfortunately, 
the index size can be extremely large, consuming upto 0(np) space, where p denotes the number 
of paths in the decomposition. 

Instead of indexing using covering structures, Cohen et al. [5] introduced 2-Hop labeling 
which, at each node u, maintains a subset of the node's ancestors and descendants. Using this, 
reachability queries between s and t can be answered by intersecting the descendant set of s 
with the ancestors of t. This technique was particularly attractive for query processing within 
a database system since it can be implemented efficiently using SQL-statements performing set 
intersections |16j . The main hurdle in using it for large graphs turns out to be its construction 
- optimally selecting the subsets to label nodes with is an NP-hard problem, and no bounds on 
the index size can be specified. HOPI indexing [16] tried to overcome these issues by clever en- 
gineering, using a divide-and-conquer approach for computing the covering. 3-Hop labeling [12] 
combines the idea of chain-covering with the 2-Hop strategy to reduce the index size. 

6.2 Accelerating Online Search 

From the discussion above, it is evident that accurately capturing the entire transitive closure 
in a manner that scales to massive size graphs remains a major challenge. Some of the recent 
approaches have taken a different path to utilize scalable indices that can be used to speed up 
traditional online search to answer reachability queries. In GRIPP [17] . the index maintains 
only one interval per node on the tree cover of the graph, but some nodes reachable through 
non-tree edges are replicated to improve the coverage. 

The recently proposed GRAIL index [3TJ [22] uses k random trees to cover the condensed 
graph, generating as many intervals to label each node with. As we already described in Sec- 
tion [3l the query processing proceeds by using the labels to quickly determine non-reachability, 
otherwise recursively querying the nodes underneath in the DAG, resulting in a worst-case query 
processing performance of 0(k(m + n)). Although GRAIL was shown to be able to build indices 
over massive scale graphs quite efficiently, it suffers from the previously discussed drawbacks. 
In our experiments, we compare various aspects of our FERRARI index against GRAIL which 
is, until now, the only technique that deals effectively with massive graphs while satisfying a 
user-specified size-constraint. 

7 Experimental Evaluation 

We conducted an extensive set of experiments in order to evaluate the performance of FERRARI 
in comparison with the state of the art reachability indexing approaches, selected based on 
recent results. In this paper, we present the results of our comparison with: GRAIL [22], 
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PathTree [II], Nuutila's Intervals [18], and PWAH-8 18 . For all the competing methods, we 
obtained original source code from the authors, and set the parameters as suggested in the 
corresponding publications. 

7.1 Setup 

Fortunately, all indexing methods are implemented using C++, making the comparisons fairly 
accurate without any platform-specific artifacts. All experiments were conducted using a Lenovo 
ThinkPad W520 notebook computer equipped with 8 Intel Core i7 CPUs at 2.30 GHz, and 16 
gigabyte of main memory. The operating system in use was a 64-bit installation of Linux Mint 
12 using kernel 3.0.0.22. 

7.2 Methodology 

The metrics we compare on are: 

1. Construction time for each indexing strategy over each dataset. Since the input to 
all considered algorithms is always a DAG, we do not include the time for computing the 
condensation graph into our measurements. 

2. Query processing time for executing 100,000 reachability queries. We consider random 
and positive sets of queries and report numbers for both workloads separately. 

3. Index size in memory that each index consumes. It should be noted that although both 
FERRARI and GRAIL take as input a size restriction parameter, the resulting size of 
the index can be quite different. PathTree, (Nuutila's) Intervals and PWAH-8 have no 
parameterized size, and depend entirely on the dataset characteristics. 

7.3 Datasets 

We used the selection of graph datasets (Table that, over the recent years, has become the 
benchmark set for reachability indexing work. These graphs are classified based on whether 
they are small (with 10-100s of thousands of nodes and edges) or large (with millions of nodes 
and edges), and dense or sparse. Due to lack of space, we refer to the detailed description of 
these datasets in [22] and [11] and, for the same reason, report results only for a salient subset 
(the full set of results can be found in the appendix of this paper). We term these datasets as 
benchmark datasets and present results accordingly in Section 17.41 

In order to evaluate the performance of the algorithms under real- world settings, where 
massive-scale graphs are encountered, we use additional datasets derived from publicly available 
sourcefl These include RDF data, an online social network, and a World Wide Web crawl. To 
the best of our knowledge, these constitute some of the largest graphs used in evaluating the 
effectiveness of reachability indices to this date. In the following, we briefly describe each of 
them, and summarize the key characteristics of these datasets in Table ^jp. 

• GovWild is a large RDF data collection consisting of about 26 million triples representing 
relations between more than 8 million entities|f| 

• Yago2 is another large-scale RDF dataset representing an automatically constructed 
knowledge graph [5], The version we used contained close to 33 million edges (facts) 
between 16.3 million nodes (entities) 

^We used the version of the files provided at http://code.google.eom/p/grail/ 
'http : //govwild.hpi-web .de/proj ect/govwild- project .html 
http : / /www .mpi- inf .mpg . de/ yago-naga/ yago/ 
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Dataset 


Type 


\y\ 


\ E \ 


Source 


ArXiV 


small, dense 


6,000 


66,707 


m 


GO 


small, dense 


6,793 


13,361 


El 


Pubmed 


small, dense 


9,000 


40,028 


pEU 


Human 


small, sparse 


38,811 


39,816 


ECS 


CiteSeer 


large 


693,947 


312,282 


M 


Cit-Patents 


large 


3,774,768 


16,518,947 


E2 


CiteSeerX 


large 


6,540,401 


15,011,260 


22. 


GO-Uniprot 


large 


6,967,956 


34,770,235 


El] 


(a) Benchmark Datasets 


Dataset 


\v\ 


\E\ 


\Vc\ 


\Ec\ 


GovWild 


8,027,601 


26,072,221 


8,022,880 


23,652,610 


YAG02 


16,375,503 


32,962,379 


16,375,503 


25,908,132 


Twitter 


54,981,152 


1,963,263,821 


18,121,168 


18,359,487 


Web-UK 


133,633,040 


5,507,679,822 


22,753,644 


38,184,039 



(b) Web Datasets 



Table 2: Datasets Used 

• The Twitter graph [3] is a representative of a large-scale social network. This graph, 
obtained from a crawl of twitter, com, represents the follower relationship between about 
50 million usersjf] 

• Web-UK is an example of a web graph dataset 12] . This graph contains about 133 million 
nodes (hosts) and 5.5 billion edges (hyperlinks) 

We present the results of our evaluation over these web-scale graphs in Section 17.51 
7.4 Results over Benchmark Graphs 

Tables|3^,-d and the charts in Figures [2HS] summarize the results for the selected set of benchmark 
graphs. In the tables, we provide the absolute values - time in milliseconds and index size in 
KBytes, while the figures help to visualize the relative performance of algorithms over different 
datasets. In all the tables missing values are marked as "— " whenever a dataset could not be 
indexed by the corresponding strategy - either due to memory exhaustion or for taking too 
long to index (timeout set to 1M milliseconds). The best performing strategy for each dataset is 
shown in bold. For GRAIL we set the number of dimensions as suggested in [22], that is, to 2 for 
small sparse graphs (Human), 3 for small dense graphs (ArXiV, GO, PubMed) and to 5 for the 
remaining large graphs. The input parameter value for FERRARI was also set correspondingly 
for a fair comparison. 

7.4.1 Index Construction 

Table [3^ and Figure [2] present the construction time for the various algorithms. The results 
show that the GRAIL index can be constructed very efficiently on small graphs, irrespective of 
the density of the graph. On the other hand, the performance of PathTree is highly sensitive 
to the density of the graph as well as the size. While GRAIL and FERRARI'S indexing time 
increases corresponding to the size of the graphs, PathTree simply failed to complete building 

e http://law.di. unimi.it/webdata/uk- union- 2006-06-2007-05/ 
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for 3 of the larger graphs - Cit-Patents and CiteSeerX due to memory exhaustion, GO-Uniprot 
due to timeout. 

The transitive closure compression algorithms Interval and PWAH-8 can index quite effi- 
ciently even the large graphs and their index is also surprisingly compact. A remarkable ex- 
ception to this is the behavior on the Cit-Patents dataset, which seems to be by far the most 
difficult graph for reachability indexing. The Interval index failed to process the graph within 
the given time limit. The related PWAH-8 algorithm finished the labeling only after around 12 
minutes and ended up generating the largest index in all our experiments (including the indices 
for the Web graphs) . This is rather surprising, as both algorithms were able to index the larger 
and denser GO-Uniprot. 
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Dataset 


Ferrari- L 


Ferrari- G 


Grail 


PathTree 


Interval 


PWAH-8 


ArXiV 


15.84 


26.62 


7.86 


4,537.39 


34.54 


70.10 


Pubmed 


14.28 


24.54 


8.21 


326.54 


20.35 


44.41 


Human 


23.36 


23.37 


15.93 


348.48 


2.70 


3.82 


GO 


6.48 


6.91 


4.83 


89.83 


5.06 


8.67 


CiteSeer 


450.12 


459.90 


2,015.90 


26,479.70 


251.10 


416.41 


CiteSeerX 


14,110.20 


16,233.40 


20,528.40 




5,808.79 


14,444.09 


GO-Uniprot 


26,105.90 


29,611.90 


34,518.40 




15,213.55 


26,745.61 


Cit-Patents 


20,665.50 


32,366.20 


21,621.70 






751,984.08 


(a) Construction Time (ms) 


Dataset 


Ferrari-L 


Ferrari- G 


Grail 


PathTree 


Interval 


PWAH-8 


ArXiV 


243.86 


275.33 


304.69 


338.07 


1,364.99 


315.24 


Pubmed 


283.68 


413.06 


457.03 


419.03 


1,523.83 


358.96 


Human 


768.88 


770.30 


1,364.45 


458.01 


160.56 


160.22 


GO 


200.37 


251.01 


344.96 


133.30 


180.58 


81.86 


CiteSeer 


13,933.90 


13,934.29 


56,925.34 


9,221.61 


7,733.94 


6,723.36 


CiteSeerX 


158,046.72 


242,236.08 


536,517.27 




430,913.36 


152,354.44 


GO-Uniprot 


429,564.04 


442,301.79 


571,590.14 




774,081.33 


249,883.80 


Cit-Patcnts 


151,631.73 


239,609.23 


309,648.94 






5,462,135.76 


(b) Index Size (Kb) 


Dataset 


Ferrari-L 


Ferrari- G 


Grail 


PathTree 


Interval 


PWAH-8 


ArXiV 


23.69 


13.91 


100.92 


3.41 


4.17 


23.22 


Pubmed 


7.58 


4.88 


12.27 


2.76 


3.16 


28.58 


Human 


0.78 


0.78 


4.98 


1.21 


1.07 


1.06 


GO 


4.10 


2.96 


4.83 


2.04 


2.47 


4.45 


CiteSeer 


6.13 


6.24 


8.05 


5.01 


8.28 


12.39 


CiteSeerX 


15.88 


9.31 


41.23 




9.27 


21.32 


GO-Uniprot 


28.30 


28.92 


5.94 




16.82 


48.70 


Cit-Patents 


778.09 


502.20 


578.83 






1,514.91 




(c) Query Processing Performance (ms 


), 100k random 


queries 




Dataset 


Ferrari-L 


Ferrari- G 


Grail 


PathTree 


Interval 


PWAH-8 


ArXiV 


62.64 


37.98 


220.31 


4.94 


5.95 


17.74 


Pubmed 


31.31 


20.28 


85.38 


4.42 


6.21 


43.58 


Human 


2.08 


1.96 


14.48 


1.30 


1.79 


6.07 


GO 


10.72 


4.64 


19.59 


2.04 


3.26 


11.43 


CiteSeer 


13.37 


13.47 


85.22 


6.12 


15.17 


30.60 


CiteSeerX 


82.76 


43.06 


700.49 




30.38 


69.21 


GO-Uniprot 


65.00 


64.72 


131.46 




31.76 


54.55 


Cit-Patents 


4,086.21 


2,667.38 


5,409.82 






1,739.30 



(d) Query Processing Performance (ms), 100k positive queries 



Table 3: Experimental Evaluation on Benchmark Datasets 
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Figure 2: Index Construction Time 
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Figure 3: Index Space Consumption 
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Figure 4: Query Processing Times for 100k Random Queries 
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Figure 5: Query Processing Times for 100k Positive Queries 



When compared to other algorithms, the construction times of FERRARI-L and FERRARI- 
G are highly scalable and are not affected much by the variations in the density of the graph. 
On all graphs, FERRARI constructs the index quickly, while maintaining a competitive index 
size. For the challenging Cit-Patents dataset, it generates the most compact index among all the 
techniques considered, and very fast - amounting to a 23x-36x speedup over PWAH-8. Further, 
FERRARI consistently generates smaller indices than GRAIL, and exhibits comparable indexing 
time. With a few more clever engineering tricks (e.g., including the PWAH-8-style interval 
encoding), it should be possible to further reduce the size of FERRARI. 

7.4.2 Query Processing 

Moving on to query processing, we consider random and positive query workloads, with results 
depicted in Figures [4] and [5] (Tables [3£ and[3pl), respectively. These results help to highlight 
the consistency of FERRARI in being able to efficiently process both types of queries over all 
varieties of graphs very efficiently. Although, for really small graphs, PathTree is the fastest as 
we explained above, it cannot be applied on larger datasets. As graphs get larger, the Interval 
indexing turns out to be the fastest. This is not very surprising, since Interval materializes 
the exact transitive closure of the graph. FERRARI-G consistently provides competitive query 
processing times for both positive as well as random queries over all datasets. As a remarkable 
result of our experimental evaluation consider the CiteSeerX dataset. In this setting, the Interval 
index consumes almost twice as much space as the corresponding FERRARI-G index, yet is only 
faster by 0.04 milliseconds for random and 12.68 milliseconds for positive queries. 

7.5 Evaluation over Web Datasets 

As we already pointed out in the introduction, our goal was to develop an index that is both 
compact and efficient for use in many analytics tasks when the graphs are of web-scale. For this 
reason, we have collected graphs that amount to up to 5 billions of edges before computing the 
condensation graph. These graphs are of utmost importance because the resulting DAG exhibits 
special properties absent from previously considered benchmark datasets. In this section, we 
present the results of this evaluation. Due to its limited scalability, we do not use PathTree 
index in these experiments. Also, through initial trial experiments, we found that for GRAIL 
the suggested parameter value of k = 5 does not appear to be the optimal choice, so instead we 
report the results with the setting k = 2, which is also used for FERRARI. The summary of 
results is provided in Table |U 

7.5.1 Index Construction 

When we consider the index construction statistics in Tables [4£i and HJa, it seems that there is 
no single strategy that is superior across the board. However, a careful look into these charts 
further emphasizes the superiority of FERRARI in terms of its consistent performance. While 
GRAIL can be constructed fast, its size can be quite large (e. g., in the case of Twitter). On the 
other hand, PWAH-8 can take an order of magnitude more time to construct than FERRARI as 
well as GRAIL as we notice for Web- UK. In fact, the Interval index which is much smaller than 
FERRARI for Twitter and Yago2, fails to complete within the time allotted for the Web-UK 
dataset. In contrast, FERRARI and GRAIL are able to handle any form of graph easily in a 
scalable manner. 

As an additional note, the index size of Interval is sometimes smaller than FERRARI which 
seems to be counterintuitive at first glance. The reason for this lies in the additional information 
maintained at every node by FERRARI, for use in early pruning heuristics. In relatively sparse 
datasets like Twitter and Yago2 the overhead of this extra information tends to outweigh the 
gains made by interval merging. If needed, it is possible to turn off these heuristics easily to get 
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Dataset 


Ferrari-L 


Ferrari- G 


Grail 


Interval 


PWAH-8 


YAG02 


27,713.50 


26,865.30 


17,163.00 


5,844.87 


9,236.71 


GovWild 


12,998.80 


18,045.30 


6,756.67 


15,060.55 


20,703.06 


Twitter 


13,065.40 


13,897.20 


9,717.39 


36,480.57 


8,219.09 


Web-UK 


17,604.90 


18,754.40 


12,275.90 




166,531.10 


(a) Construction Time (ms) 


Dataset 


Ferrari-L 


Ferrari- G 


Grail 


Interval 


PWAH-8 


YAG02 


372,150.70 


448,139.06 


575,701.28 


182,962.96 


137,878.21 


GovWild 


206,475.06 


297,724.03 


282,054.38 


921,605.13 


311,359.35 


Twitter 


384,049.21 


384,368.44 


637,072.31 


85,648.12 


97,859.81 


Web-UK 


616,486.63 


647,050.45 


799,932.80 




266,342.83 


(b) Index Size (Kb) 


Dataset 


Ferrari-L 


Ferrari- G 


Grail 


Interval 


PWAH-8 


YAG02 


12.00 


10.95 


16.56 


10.45 


12.62 


GovWild 


60.27 


31.77 


42.62 


13.33 


33.30 


Twitter 


5.55 


5.65 


19.27 


8.66 


10.32 


Web-UK 


19.11 


19.29 


39.21 




20.45 


(c) Query Processing Performance (ms), 100k random queries 


Dataset 


Ferrari-L 


Ferrari- G 


Grail 


Interval 


PWAH-8 


YAG02 


59.39 


38.43 


97.99 


21.70 


44.19 


GovWild 


171.46 


85.12 


228.98 


29.84 


126.96 


Twitter 


10.24 


10.18 


76.07 


18.21 


36.01 


Web-UK 


25.54 


18.01 


95.25 




43.73 



(d) Query Processing Performance (ms), 100k positive queries 



Table 4: Experimental Evaluation on Web Datasets 
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around this problem. However, we retain them across the board to avoid dataset-specific tuning 
of the index. 

7.5.2 Query Processing 

Finally, we turn our attention to the query processing performance over Web-scale datasets. As 
the results summarized in Tables^: andHJi demonstrate, the FERRARI variants and the Interval 
index provide the fastest index structures. For Web-UK and Twitter, the FERRARI variants 
outperform all other approaches. The performance of GRAIL, as predicted, and PWAH-8 are 
inferior in comparison to Interval and FERRARI-G when dealing with both random and positive 
query loads. 

In summary, our experimental results indicate that FERRARI, in particular the global bud- 
geted variant, is highly scalable and consistent in being able to index a wide variety of graphs, 
and can answer queries extremely fasfl This, we believe, provides a compelling reason to use 
FERRARI-G on a wide spectrum of graph analytics applications involving large to massive-scale 
graphs. 

8 Conclusions &; Outlook 

In this paper, we presented an efficient and scalable reachability index structure, FERRARI, that 
allows to directly control the query processing/space consumption tradeoff via a user-specified 
restriction on the resulting index size. The two different variants of our index allow to either 
specify the maximum size of the resulting node labels, or to impose a global size constraint 
which allows the dynamic allocation of budgets based on the importance of individual nodes. 
Using a theoretically sound technique, FERRARI assigns a mixture of exact and approximate 
identifier ranges to nodes so as to speed up both random as well as positive reachability queries. 
Using an extensive array of experiments, we demonstrated that the resulting index can scale to 
massive-size graphs quite easily, even when some of the state of the art indices fail to complete 
the construction. FERRARI provides very fast query execution, demonstrating substantial 
gains in processing time of both random and positive queries when compared to the previous 
state-of-the-art method, GRAIL. 

Results presented in this paper open up a range of possible future directions that we plan 
to pursue. First of all, we would like to integrate FERRARI in a number of graph analytics 
algorithms, starting from shortest path computation to more complex graph mining techniques 
(e. g. Steiner trees) in order to study its impact on the performance of these expensive operations. 
We also would like to pursue further optimizations of the FERRARI index by, for instance, the 
use of an improved tree covering algorithm designed to work synergistically with FERRARI to 
generate high quality intervals to begin with, and the use of compression techniques for interval 
encoding afterwards. 

A Additional Experimental Results 

A.l Benchmark Datasets 

The complete results for our experimental evaluation over different size constraints are depicted 
in Tables [5] and H 

7 In a light-hearted view, our index shares the characteristics of its namesake F-l racing car, in the sense that 
although not being the most powerful in the fray, it consistently tends to win on all circuits under diverse input 
conditions. 
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Ferrari-L Ferrari-G Grail 



Dataset 


k = 1 


k = 2 


k = 3 


k = 5 


k = l 


k = 2 


fc = 3 


k = 5 


d= 1 


d = 2 


d = 3 


d = 5 


PathTree 


Interval 


PWAH-8 


ArXiV 


33.66 


26.61 


23.69 


19.50 


30.90 


20.73 


13.91 


8.78 


180.26 


122.85 


100.92 


75.41 


3.41 


4.17 


23.22 


Pubmed 


10.20 


8.23 


7.58 


6.05 


7.66 


5.77 


4.88 


3.73 


20.64 


13.87 


12.27 


10.55 


2.76 


3.16 


28.58 


Human 


0.78 


0.78 


0.76 


0.77 


0.77 


0.78 


0.78 


0.77 


4.80 


4.98 


5.04 


5.29 


1.21 


1.07 


1.06 


GO 


6.89 


4.45 


4.10 


3.51 


4.28 


3.41 


2.96 


2.85 


10.33 


6.46 


4.83 


4.13 


2.04 


2.47 


4.45 


CiteSeer 


8.02 


6.57 


6.40 


6.13 


6.14 


6.18 


6.15 


6.24 


8.54 


8.28 


8.19 


8.05 


5.01 


8.28 


12.39 


CiteSeerX 


20.71 


17.90 


17.04 


15.88 


16.09 


12.62 


10.95 


9.31 


63.06 


50.02 


46.49 


41.23 




9.27 


21.32 


GO-Uniprot 


32.45 


28.92 


29.69 


28.30 


32.27 


29.36 


28.79 


28.92 


6.51 


6.01 


6.01 


5.94 




16.82 


48.70 


Cit-Patents 


1,100.32 


858.15 


798.86 


778.09 


900.56 


722.61 


672.76 


502.20 


2,545.37 


1,123.68 


851.24 


578.83 






1,514.91 



(a) Query Processing Performance (ms), 100k random queries 



Ferrari-L Ferrari-G Grail 



Dataset 


k = 1 


fc = 2 


k = 3 


k = 5 


k = l 


fc = 2 


fc = 3 


fc = 5 


d= 1 


d = 2 


d = 3 


d= 5 


PathTree 


Interval 


PWAH-8 


ArXiV 


86.89 


72.11 


62.64 


52.77 


84.12 


56.85 


37.98 


21.30 


398.56 


275.74 


220.31 


172.43 


4.94 


5.95 


17.74 


Pubmed 


39.17 


33.38 


31.31 


25.83 


31.20 


24.26 


20.28 


12.09 


123.42 


91.34 


85.38 


79.78 


4.42 


6.21 


43.58 


Human 


2.48 


2.08 


2.04 


2.02 


2.03 


1.96 


2.02 


2.04 


14.16 


14.48 


15.22 


18.25 


1.30 


1.79 


6.07 


GO 


12.89 


10.70 


10.72 


6.92 


10.05 


5.80 


4.64 


4.25 


24.77 


20.06 


19.59 


19.84 


2.04 


3.26 


11.43 


CiteSeer 


25.23 


18.96 


14.33 


13.37 


13.52 


13.33 


13.51 


13.47 


72.81 


77.57 


83.83 


85.22 


6.12 


15.17 


30.60 


CiteSeerX 


106.56 


96.52 


102.09 


82.76 


86.80 


64.20 


53.04 


43.06 


882.49 


731.03 


741.47 


700.49 




30.38 


69.21 


GO-Uniprot 


68.90 


71.62 


66.51 


65.00 


71.70 


73.24 


67.69 


64.72 


135.06 


125.52 


126.90 


131.46 




31.76 


54.55 


Cit-Patents 


6,581.21 


4,836.69 


4,443.29 


4,086.21 


5,210.22 


4,016.88 


3,621.21 


2,667.38 


15,743.66 


9,865.95 


7,712.84 


5,409.82 






1,739.30 



(b) Query Processing Performance (ms), 100k positive queries 



Table 5: Experimental Evaluation on Benchmark Datasets - Query Processing (Varying Size Constraint) 



Ferrari-L Ferrari-G Grail 



Dataset 


k = 1 


k = 2 


k - 3 


k = 5 


k = 1 


k = 2 


k = 3 


k = 5 


d = 1 


d = 2 


d = Z 


d = 5 


PathTree 


Interval 


PWAH-8 


ArXiV 


164.44 


205.17 


243.86 


312.14 


169.91 


222.63 


275.33 


380.86 


117.19 


210.94 


304.69 


492.19 


338.07 


1,364.99 


315.24 


Pubmed 


213.56 


250.32 


283.68 


340.43 


254.88 


333.98 


413.06 


571.15 


175.78 


316.41 


457.03 


738.28 


419.03 


1,523.83 


358.96 


Human 


764.38 


768.88 


769.61 


769.95 


769.85 


770.30 


770.75 


770.78 


758.03 


1,364.45 


1,970.87 


3,183.71 


458.01 


160.56 


160.22 


GO 


165.25 


186.05 


200.37 


217.45 


192.36 


231.57 


251.01 


265.21 


132.68 


238.82 


344.96 


557.24 


133.30 


180.58 


81.86 


CiteSeer 


10,876.86 


13,220.50 


13,844.14 


13,933.90 


13,927.15 


13,934.29 


13,934.29 


13,934.29 


13,553.65 


24,396.57 


35,239.50 


56,925.34 


9,221.61 


7,733.94 


6.723.36 


CiteSeerX 


134,770.72 


140,793.14 


146,975.22 


158,046.72 


152,661.80 


187,058.08 


209,991.64 


242,236.08 


127,742.21 


229,935.97 


332,129.74 


536,517.27 




430,913.36 


152,354.44 


GO-Uniprot 


197,329.62 


258,566.04 


318,769.95 


429,564.04 


197,334.69 


258,576.47 


319,818.29 


442,301.79 


136.092.89 


244,967.20 


353,841.52 


571,590.14 




774,081.33 


249,883.80 


Cit-Patents 


92,089.32 


105,895.74 


121,532.11 


151,631.73 


106,902.60 


140,079.25 


173,255.88 


239,609.23 


73,725.94 


132,706.69 


191,687.44 


309,648.94 






5,462,135.76 



(a) Index Size (Kb) 



to Ferrari-L Ferrari-G Grail 



Dataset 


*: = 1 


k = 2 


k = 3 


k = 5 


k = 1 


k = 2 


k = 3 


k = 5 


d = 1 


d = 2 


d = 3 


d = 5 


PathTree 


Interval 


PWAH-8 


ArXiV 


7.72 


11.86 


15.84 


19.49 


14.86 


22.15 


26.62 


33.50 


2.59 


4.57 


7.86 


14.03 


4,537.39 


34.54 


70.10 


Pubmed 


10.56 


13.09 


14.28 


17.67 


16.42 


23.31 


24.54 


27.11 


3.00 


5.21 


8.21 


14.13 


326.54 


20.35 


44.41 


Human 


23.71 


23.36 


22.77 


23.97 


23.89 


23.37 


22.90 


23.33 


10.08 


15.93 


32.72 


64.36 


348.48 


2.70 


3.82 


GO 


5.34 


6.03 


6.48 


6.62 


6.74 


6.72 


6.91 


6.94 


1.93 


3.23 


4.83 


7.77 


89.83 


5.06 


8.67 


CiteSeer 


483.24 


478.34 


455.58 


450.12 


477.39 


463.38 


459.42 


459.90 


273.73 


487.09 


1,037.21 


2,015.90 


26,479.7 


251.10 


416.41 


CiteSeerX 


12,869.70 


13,336.30 


13,748.50 


14,110.20 


13,978.20 


14,970.80 


15,415.50 


16,233.40 


3,397.65 


6,638.96 


11,436.70 


20,528.40 




5.808.79 


14,444.09 


GO-Uniprot 


9,173.19 


15,526.80 


19,205.90 


26,105.90 


15,495.80 


23,040.90 


26,817.10 


29,611.90 


2,841.04 


5,131.31 


14,764.30 


34,518.40 




15,213.55 


26,745.61 


Cit-Patents 


16,085.90 


17,418.30 


18,649.30 


20,665.50 


20,211.80 


23,861.60 


27,179.20 


32,366.20 


3,842.96 


7,516.79 


12,123.50 


21,621.70 






751,984.08 



(b) Construction Time (ms) 



Table 6: Experimental Evaluation on Benchmark Datasets - Indexing (Varying Size Constraint) 



A.2 Web Datasets 

The complete results for our experimental evaluation over different size constraints arc depicted 
in Tables and E 
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Ferrari-L Ferrari-G Grail 



Dataset 


k = l 


fc = 2 


k = 5 


fc = 10 


k = l 


fc = 2 


fc = 5 


fc = 10 


d= 1 


d = 2 


d = 5 


d = 10 


Interval 


PWAH-8 


YAG02 


14.82 


12.00 


11.19 


10.98 


11.13 


10.95 


10.73 


10.78 


30.53 


16.56 


16.66 


16.80 


10.45 


12.62 


GovWild 


102.29 


60.27 


51.70 


36.62 


78.64 


31.77 


21.91 


20.38 


76.04 


42.62 


33.53 


32.98 


13.33 


33.30 


Twitter 


6.12 


5.55 


5.58 


5.68 


5.60 


5.65 


5.69 


5.71 


18.76 


19.27 


19.96 


21.11 


8.66 


10.32 


Web-UK 


22.74 


19.11 


19.25 


19.44 


19.15 


19.29 


18.88 


18.92 


42.24 


39.21 


40.53 


43.00 




20.45 



(a) Query Processing Performance (ms), 100k random queries 



Ferrari-L Ferrari-G Grail 



Dataset 


k = 1 


k = 2 


k = 5 


fc = 10 


k = 1 


fc = 2 


fc = 5 


fc = 10 


d= 1 


d = 2 


d = 5 


d = 10 


Interval 


PWAH-8 


YAG02 


73.65 


59.39 


44.67 


38.83 


48.35 


38.43 


38.82 


38.07 


107.33 


97.99 


101.97 


107.55 


21.70 


44.19 


GovWild 


293.99 


171.46 


124.11 


75.04 


166.02 


85.12 


45.72 


42.51 


248.24 


228.98 


165.33 


167.25 


29.84 


126.96 


Twitter 


11.10 


10.24 


9.80 


11.63 


10.34 


10.18 


10.18 


10.08 


248,494.31 


76.07 


83.57 


96.13 


18.21 


36.01 


Web-UK 


28.45 


25.54 


22.82 


18.22 


23.64 


18.01 


17.49 


16.85 


26,792.72 


95.25 


102.59 


113.83 




43.73 



(b) Query Processing Performance (ms), 100k positive queries 



Table 7: Experimental Evaluation on Web Datasets - Query Processing (Varying Size Constraint) 



Ferrari-L 



Ferrari-G 



Grail 



Dataset 



= 10 



= 10 



d= 10 



Interval PWAH-8 



YAG02 346,225.04 372,150.70 417,050.40 444,546.26 405,498.36 448,139.06 453,340.72 453,560.94 319,834.04 575,701.28 1,343,302.98 - 182,962.96 137,878.21 

GovWild 181,599.94 206,475.06 269,511.88 343,819.93 227,210.46 297,724.03 475,154.42 511,974.72 156,696.88 282,054.38 658,126.88 1,284,914.38 921,605.13 311,359.35 

Twitter 369,201.71 384,049.21 384,358.04 384,367.92 384,348.90 384,368.44 384,368.70 384,368.70 353,929.06 637,072.31 1,486,502.06 - 85,648.12 97,859.81 

Web-UK 533,427.01 616,486.63 635,946.11 644,624.98 633,044.02 647,050.45 659,961.81 669,742.03 444,407.11 799,932.80 1,866,509.86 - 266,342.83 



(a) Index Size (Kb) 



Ferrari-L Ferrari-G Grail 



Dataset 


k = 1 


fc = 2 


fc = 5 


fc = 10 


fc = 1 


fc — 2 


fc = 5 


fc = 10 


d= 1 


d= 2 


<i = 5 


d= 10 


Interval 


PWAH-8 


YAG02 
GovWild 

Twitter 
Web-UK 


27,475.10 
12,025.40 
13,434.70 
18,145.00 


27,713.50 
12,998.80 
13,065.40 
17,604.90 


27,679.40 
15,824.30 
13,403.20 
18,029.70 


26,778.80 
16,565.50 
13,768.00 
18,506.30 


28,169.40 
15,487.20 
13,442.50 
18,463.30 


26,865.30 
18,045.30 
13,897.20 
18,754.40 


26,614.70 
14,874.80 
13,877.70 
19,056.40 


26,511.40 
14,941.60 
13,885.50 
19,202.70 


9,046.63 
3,535.81 
5,784.66 
7,273.53 


17,163.00 
6,756.67 
9,717.39 

12,275.90 


59,587.00 
24,482.30 
70,038.80 
82,185.90 


123,328.00 
50,129.00 
161,704.00 
181,947.00 


5,844.87 

15,060.55 
36,480.57 


9,236.71 
20,703.06 
8,219.09 
166,531.10 



(b) Construction Time (ms) 



Tabic 8: Experimental Evaluation on Web Datasets - Indexing (Varying Size Constraint) 
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